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Abstract. We study the stochastic dynamics of sequences evolving by single site 
mutations, segmental duplications, deletions, and random insertions. These processes 
are relevant for the evolution of genomic DNA. They define a universality class of non- 
equilibrium ID expansion-randomization systems with generic stationary long-range 
correlations in a regime of growing sequence length. We obtain explicitly the two- 
point correlation function of the sequence composition and the distribution function 
of the composition bias in sequences of finite length. The characteristic exponent 
X of these quantities is determined by the ratio of two effective rates, which are 
explicitly calculated for several specific sequence evolution dynamics of the universality 
class. Depending on the value of we find two different scaling regimes, which are 
distinguished by the detectability of the initial composition bias. All analytic results 
are accurately verified by numerical simulations. We also discuss the non-stationary 
build-up and decay of correlations, as well as more complex evolutionary scenarios, 
where the rates of the processes vary in time. Our findings provide a possible example 
for the emergence of universality in molecular biology. 
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1. Introduction 

Universality classes with long-range correlations are a hallmark of systems with many 
degrees of freedom throughout physics. In equilibrium condensed matter systems, they 
mark critical points or phases with a particular symmetry. Out of equilibrium, power- 
law correlations are more generic but the classification of universality classes becomes 
more difficult. Well known examples are surface growth, reaction-diffusion systems, and 
self-organized criticality. 

A striking example of long-range correlations in a biological system has been found 
in the base pair composition of genomic DNA more than a decade ago El 0] • Since 
then, the composition correlations of DNA have been studied extensively by a variety 
of different methods, and nowadays it is well established that long-range correlations 
appear in the genomes of many species [H El lEl 13 El Ej ■ The form of these correlations, 
however, is much more complex than simple power-laws. Within one chromosome, there 
is often a variety of different scaling regimes and effective exponents, and sometimes no 
clear scaling at all. 

Despite the ubiquity of long-range correlations in genomes, little is known about 
their origin. A likely dynamical scenario is that they are generated by the stochastic 
processes of molecular sequence evolution. In ^U], we have studied a minimal 
evolutionary dynamics producing long-range correlations that can be compared to 
DNA sequence data in a quantitative way. This dynamics incorporates local processes 
including single site mutations, duplications and deletions of existing segments of the 
sequence, and insertions of random segments. It is inspired by a model introduced by 
Li in 1989 ^j. We have proved the emergence of long-range correlations in this 
dynamics: the correlation function of the generated sequences decays as C{r) oc r~" for 
large r, and we have obtained an exact expression for the decay exponent a. 

In the first part of this article (sections EHSI); we present a more detailed calculation 
of the expectation value of the two-point correlation function and the finite-size 
distribution function of the sequence composition bias. We show that these quantities 
exhibit consistent scaling and that their functional forms are given mathematically as 
solutions of simple differential equations. The resulting power-law behavior can be 
expressed in terms of a single basic exponent x, the scaling dimension of the local 
composition bias. This exponent is determined by just two effective parameters, which 
are simple functions of the rates of the elementary processes. As a function of x, we find 
two distinct scaling regimes. In the strong-correlation regime (x < 1/2), the ancestral 
composition bias can be detected in arbitrarily long sequences, in the weak-correlation 
regime (x > 1/2), this is possible only up to a characteristic sequence length. 

In the second part of the article (sections IHl and [7|), we analyze various 
generalizations of the sequence evolution model introduced in |li]jj and demonstrate that 
they form a consistent universality class of non-equilibrium processes with generic long- 
range correlations. These processes are biased segmental insertions as well as mutations 
with biased rates, which break the Z2 symmetry of the original model. The purpose of 
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this generalization is two-fold. On the one hand, the extended model is biologically more 
accurate, since there is strong evidence for the presence of GC-content biased segmental 
insertion processes [T3j, as well as biased mutation rates jHj during evolutionary history. 
Taking into account these processes proves crucial for practical data analysis. On the 
other hand, the model conceptually delineates what are the essential ingredients of this 
non-equilibrium universality class: Long-range correlations emerge from the interplay of 
processes producing correlations on short scales, exponential growth of sequence length, 
and local randomization processes. The universal scaling behavior is distinguished from 
the symmetry breaking caused by biased mutation processes. Furthermore, we generalize 
the scaling picture to dynamical aspects of the build-up and decay of correlations in time. 
We conclude with a discussion of the role of universality in a biological context. 



2. Sequence evolution model 

The stochastic evolution model generates sequences S = {si, . . . , sn) of variable length 
N{t). For simplicity, their letters are taken from a binary alphabet; Sk = ±1. The 
elementary evolutionary steps are single site mutations, duplications and deletions of 
existing sequence segments of arbitrary lengths, and insertion of random segments. In 
fact, these processes are assumed to be the major local processes acting on genomic DNA 
sequences during evolutionary history [T31 . Formally, the dynamics of the processes can 
be defined by 

•••) — > {■■■,— s,---) mutation rate /i 

(■ ■ ■ , {s)i, ■ ■■) (■ ■ ■ , {s)i, {s)i, ■ ■ ■) duplication rate Si 

(■ ■ ■ , s, ■ ■ ■) {■ ■ ■ ,s, (x)^, ■ • •) insertion rate 7^ 

{■ ■ ■ ,{s)i,- ■ ■) (■■■,■■■) deletion rate 77, (1) 

where {s)i denotes an existing sequence segment of length i > I, and {x)i is a segment 
of length i with uniformly distributed random letters Xi = ±1. Note that by convention 
we do not allow insertion of random segments prior to the first sequence element. 
Duplication and insertion events introduce a new sequence segment next to an existing 
one and shift all subsequent letters i positions to the right, thereby increasing the 
sequence length by i. Conversely, deletions shorten the length by i. We will restrict 
all processes to a maximum range £max, i-e., all rates 6e, 7/, and 77 are zero for 
^ > ^max- Repeatedly running the processes over a time t produces a statistical ensemble 
of sequences; the corresponding averages are denoted by {...){t). This ensemble is 
characterized by the rates of the processes and by the initial sequence. If we focus on 
scales much larger than £max, the statistical properties of the generated sequences will 
then turn out to be determined by just two effective parameters, the asymptotic growth 
rate A and the effective mutation rate fies, defined by 

A = 5efr + 7eff-7efr (2) 

HeS = fi+ 27cff- (3) 
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Both are simple functions of the cumulative rates of the "microscopic" processes, 

■^max ^max ^max 

'5eff = ^^^£, lts = J2^^i^ ^"^^ 7eff = 5^^7£"- (4) 

e=i 1=1 1=1 

The implementation of a numerical simulation of this dynamics is described in 16.41 We 
use the simulations to verify analytically derived results of the following sections. 

3. Sequence growth and average composition 

3.1. Average sequence length 

Running the processes defined in (0) on sequences will change their lengths N{t). The 
dynamics of (A^) (t) averaged over an ensemble of sequences is 



{N)(t) 

1=1 t=l 



{N){t). (5) 



The finite size correction factor a = 1 — {^ — 1)/ {N){t) accounts for the fact that in a 
sequence of length N{t) there are only N(t) — i + 1 possibilities to duplicate or delete 
a segment of length i. Using the initial condition N{t = 0) = A^o^ the solution of in 
the asymptotic regime, {N)(t) ^ ^max, is then given by 

{N){t) = Noexp{Xt) (6) 

with the asymptotic growth rate A, as defined in (j21). 

3.2. Average composition bias 

The average composition of a sequence element is measured by the expectation value 
{sk){t), and in the following we will show that any initial bias decays due to mutations 
and random insertions. can be written as the difference 

{skm = Pji{t)-p,-{t), (7) 

where P^{t) and Pj^if) denote the probabilities of finding Sk = +1 or Sk = —1 at time t. 
The Master equations for P^{t) of the first sequence site si are given by 

-Ptit) = f^[P^ - Pt] + 77 [Pt - Pt]- (8) 

i=i 

Omitting deletion and starting with a single site S{t = 0) = (+1), we obtain 

(si)(t) =exp(-2/it). (9) 
If one additionally allows deletion, any initial bias of si will even decay faster. 



Universality of Long-Range Correlations in Expansion- Randomization Systems 



S{t) 



S{t + dt) 



Si 




Sk-l 




Sfc 




Sk+e 








Si 




Sfc 





Figure 1. Illustration of the different mechanisms contributing to dP^{t) / dt. 



Sequence sites at positions k > 1 are also affected by duplications and insertions, 
and the Master equations for the probabilities P^{t) take the form 

^max 



^P^it) =^ [P^ - Pt] + - 1' ^) (1/2 - P^) 

k-2 fe-1 

£=i e=i 



(10) 



f] 



The different mechanisms contributing to dP^ (t) / dt are illustrated in figure ^ Any 
bias at site Sk is again diminished due to single-site mutations, as specified by the first 
term on the r.h.s. of (fTUj) . but also by insertions of random segments (xj, . . . ,Xi+i-i) 
of length i at positions i = k — i + 1, . . . , k, which effectively randomize Sk (second 
term). Additionally, there is a "shift" of composition bias from preceding sequence 
positions Sk-e due to insertions of random segments (x,, . . . , Xj+^.i) of length i at 
positions i = 2, . . . , k — i (third term), or duplications of existing sequence segments 
(sj, . . . , Si+e-i) with i = 1, . . . , k — i (fourth term). Transport of bias from sites Sk+e to 
Sk, on the other hand, occurs due to deletion of existing segments (sj, . . . , Sj+£_i) with 
i = 1, . . . , k (last term). 

In order to reveal the large-distance asymptotics of this dynamics for k ^ imax 
and in large sequences with N(t) 3> ^max, we carry out a continuum limit of (fTUj) . i.e., 
we replace the discrete index A; by a continuous variable and write {s{k,t)) = {sk)it) . 
Using we obtain a differential equation describing the asymptotic dynamics. 



g-^{s{k,t)) = -2fi,s{sik,t)) - \k—{s{k,t)), 



with the asymptotic growth rate A and the effective mutation rate /lefr defined in Q 
and (jni). The transport of composition bias due to the net exponential expansion of 
the sequences thereby gets incorporated in a dilatation operator of the functional form 
kd/dk; all finite size effects vanish in this regime. Equation (|TT| has a solution of the 
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time t position k 

Figure 2. Average composition bias {sk){t): (a) Decay of (sfc)(t) in time for 
k = 1,2, 5, 10. Rates of the processes are: /i = 1.0, 6i = 4.0, 7^ = 0.2, 7^ = 0.5. 
The red line is the analytic lower bound on the rate of convergence C31)-(b) 
Stationary (s^) with fixed (si) = +1 at different rates of the elementary 
processes: (1) /i = 1.0,63 = 15.0,72^ = 1-0,77' = 1-0; (2) ^ = 1-0, = 
16.0,72+ = 1-0' 7r = 2.0; (3) fi = 1.0,^2 = 6.0,73+ = 2.0,74" = 0-5; (4) 
/i = 1.0, (5i = 4.0,7^ = 1.0, 74^ = 0.5. Red Unes denote the corresponding 
analytic asymptotics (|14|) . All ensemble averages were obtained by averaging 
over 10^ simulated sequences. 



form 

{s{k,t)) = e-^^'^«'S{ke~^'), (12) 

where S{x) is a scaling function. This solution describes two different regimes of the 
expectation value, depending on the boundary condition chosen, (a) With fixed initial 
condition si{t = 0) = 1, we have for any fixed k 

{s{k,t)) (x. exp{-2fiest). (13) 

as shown in Figure a) for different values of k and a given set of process rates. Thus, 
{s{k,t)) = for all k in the limit t 00. (b) With fixed boundary condition (si) = +1 
for all t (i.e., suppressing mutations of the first element), we obtain a power-law decay 
of the composition bias along the sequence, 

{s{k)) oc k-^ with X = (14) 

A 

Numerical verification of the asymptotics (|14j) for this type of dynamics is presented in 
Figure |2fb), where we show the measured (sk) in ensembles of sequences with different 
sets of rates using the simulation algorithm described in 16.41 
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4. Stationary two-point correlations 
4.I. Master equation 

The dynamics of the composition correlation function C{k,r,t) = {skSk+r)it) between 
two sequence positions Sk and Sk+r can be derived by writing it as 

C(fc, r, t) = P,^{k, r, t) - Pop(A:, r, t), (15) 

where Peq/op{k,r,t) denote the joint probabilities of simultaneously finding two equal 
or opposite symbols, respectively, at sequence positions k and k + r and time t. For 
simplicity, we start with a restricted sequence evolution model where all processes are 
limited to single sequence sites (£max = !)• The Master equation for Peq{k,r,t) in the 



single-site model takes the form 

^Peq(fc, r, t) = 2fi [Pop(fc, r) - Peq(fc, v)] (16a) 

+ l/2 7+[Pop(fc,r)-Peq(/c,r)] (165) 

+ 1/2 7+ [Pop(A; - 1, r) - Peq(A; - 1, r)] (16c) 

+ 1/2 7+ [Peq(A; - 1, r) - Peq(A;, r)] (16d) 

+ [(r - 1)7+ + r5i] [Peq(A;,r - 1) - Peq(A;,r)] (16e) 

+ r7r [Peq(A;, r + 1) - P,^{k, r)] (16/) 

+ [(A; - 2)7+ + {k- 1)5,] [Peq(A; - 1, r) - Peq(A;, r)] (16(7) 

+ A;7r [Peq(A; + l,r)-Peq(A:,r)]. (16/i) 



The different mechanisms contributing to dPeq{k,r,t)/dt are illustrated in figure El and 
will now be discussed in order. fll6aj) describes the change in Peq{k, r, t) due to mutation 
of any of the two sites (therefore two possibilities) in a pair of equal or opposite symbols 
at positions k and k + r. ()16&|) treats the insertion of a random site at position 
k + r, which in half of the cases will switch a pair of equal symbols = Sk+r to 
opposing symbols Sk = —Sk+r, while two opposing symbols might be switched to equal 
symbols, accordingly. A similar contribution arises from a random insertion at position 
k. However, such an event can be regarded as duplication of Sk-i with a successional 
mutation of the newly introduced element Sk in half of the cases. If such a mutation 
occurs, the event is equivalent to (jl6fejl with the difference that contributions of this 
processes to dPeq{k,r,t)/dt do now depend on the joint probabilities Peq/op{k — l,r,t) 
()16cj) . In the other half of the cases, where the newly inserted random element is 
equal to Sk-i, the process causes a shift of joint probability from Peq{k — l,r,t) to 
Peq{k,r,t) ()16dj) . Transport of joint probability at distance r — 1 to such at distance r 
takes place if a random site is inserted at sequence positions k + 1, . . . , k + r — 1, or if any 
site Sk, . . . , Sk+r-i is duplicated ()16ej) . On the other hand, deletion of any Sk+i, . . . , Sk+r 
produces a transport of joint probability from distance r + 1 to r ( |16J . Despite this 
"expansion" and "contraction" transport of joint probability from distances r + 1 or 
r — 1 to r at fixed k, there is also a "horizontal" shift along the sequence: insertion 
of a random site at positions 2, . . . , — 1 or duplication of any site si, . . . , s^-i shifts 



Universality of Long-Range Correlations in Expansion- Randomization Systems 8 



S{t) 



S{t + dt) 



Sl 




Sfc-1 


Sk 


Sfe+1 




Sfe+r-1 


Sk-\-r 


Sk+r+1 






7 \ 


1 








Sk 








Sk+r 







Figure 3. Illustration of the different mechanisms contributing to the dynamics 
of Pcq{k,r,t). Effectively mutational events are those that randomize either 
Sk, or Sk+r- "Expansion" or "contraction" transport of joint probability from 
Peq{k,r± 1) to Peq{k, r) occurs due to duplication, insertion, or deletion events 
at sequence positions between Sk and Sk+r- "Horizontal" shift from Pcq{k±l, r) 
to Peq{k, r) takes place if a duplication, insertion, or deletion occurs at sequence 
positions prior to Sk- 



joint probability Pcq{k — 1, r, t) to Peq{k, r, t) ( |16g| ), while deletion of an si, . . . , shifts 

Peq(fc + 1, r, t) to Peq{k, T, t) . 

Since we are interested in a stationary solution of this dynamics, we have to 
consider the limit t — > oo. It has already been shown in section that asymptotically 
{sk){t) — * for large t at all k. Furthermore, all processes are acting homogeneously 
along the sequence, and therefore we expect the joint probabilities also to be independent 
of k in the long-time limit, i.e., Peq/op(^, = -Peq/op(^ ± l;''^) (verification is given by 
our numerical simulations). The dynamics (18) then simplifies to 

^Peq(r, t) = (2^ + 7i+) [Pop(r) - Peq(r)] 

+ [(r - l)7i+ + rS^] [Peq(r - 1) - Peq(r)] (17) 
+ r7-[Peq(r + l)-Peq(r)]. 

By exchanging Peq and Pop, we can state an equivalent equation for Pop(r, t). Using (fTHjl . 
we obtain the dynamics of the correlation function C (r, t) for large t 

^C{r,t)= - (4^ + 27+) C(r) 

+ [(r - 1)7+ + r6i] [C(r - 1) - C{r)] (18) 
+ r7r [C(r+ 1) -C(r)]. 

This equation for the dynamics of C{r,t) in the single-letter model (^max = 1) is valid 
for all distances r in the limit t — oo. A corresponding dynamics can, in principle, 
be obtained analogously for the general model with £max > 1, although it will be more 
complicated due to finite size effects coming into play for r < £max- However, for 
large distances r ^ ^max, these finite size effects can be neglected, and the asymptotic 
dynamics of C (r, t) in the general segmental model is then given by 

|:C(r,t)= -4/ieffC(r) 
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+ - ^^^t + {r-i+ m] [C{r - £) - C{r)] (19) 

+ Etr t^^" + ^) - 

with the effective mutation rate /ies, as defined in Q. Note that the dynamics (fTHj) of 
the single-letter model is a special case of the general dynamics fll9|) with £niax = 1- 

4-2. Stationary solutions 

In the following, we will derive analytic solutions of the stationary correlations C{r) in 
our model. We start with the special case of only single-site duplications and mutations 
(/i, 5i > 0, all other rates are zero). In this case, the solution of the dynamics (fTn|) in 
the stationary state, dC{r,t)/dt = 0, obeys the recursion equation 

C{r) = ^— C(r - 1) with a = ^. (20) 

a + r oi 

Using C(0) = 1, the recursion can easily be solved, yielding 

^w=ri^- pi) 

n=l 

Introducing the gamma function and the beta function, defined by 

T{x) = r e-H^-'dt, B{x, y) = (22) 
J a i + y) 

C{r) can finally be rewritten in the form 

To investigate the asymptotic regime, we evaluate the asymptotic behavior of i?(r, a) 
for r ^ 1 which, in general, is given by 



Bir, a) oc Via) r~ 



(24) 



2r \ \r ^ 

Applying this asymptotics to equation we obtain 

C(r) oc r"". (25) 

Hence, we have proven the existence of long-range correlations in the simplified 
single-site duplication-mutation model. The exponent a is determined by a simple 
balance between the randomization processes (mutations) and the expansion processes 
(duplications) which create correlations between neighboring sites and transport these 
correlations to larger distances due to an overall expansion of the system. 

We have performed extensive Monte Carlo simulations of this model using the 
algorithm presented in 16.41 Figure Efa) shows the numerical C{r) for the duplication- 
mutation dynamics with various rates of 6i and /i, which is in excellent agreement with 
the analytic expression (j2Sl)- 

For reasons of comparability with former studies ^] , we also calculated power 
spectra of the simulated sequences. In the stationary state, the power spectrum P{f) is 
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length r frequency /= 1 Ir 

Figure 4. Single-site duplication-mutation model: (a) Stationary composition 
correlation C(r) at different rates of the elementary processes; numerical results 
(circles) and the analytic form H23|) (lines) for fj, = 1.0, 5i varying. C(r) 
is averaged along the sequence, (b) Power spectra of simulated sequences 
for /X = 1.0 and 6i varying: numerical results (circles) with the analytically 
predicted P{f) oc in those cases where 6i > 5 (lines). The dynamics of 
the sequences was simulated until they reached a length of = 2^^ ~ 10®. All 
data sets were obtained by averaging over 100 runs. 

the Fourier transform of the correlation function C{r). In our case, the large distance 
asymptotics of the correlation function is given by C{r) oc r~", and the power spectrum 
will therefore also decay algebraically, i.e., P{f) oc with the exponent (3 = 1 — a P^ . 
The resulting data is shown in figure EKb). Due to the fact that C(r) oc r~" does only 
hold in the limit of r ^ 1, the analytically estimated scaling P{f) oc is present 
at lower frequencies, but crosses over to a different behavior at higher ones. At values 
of a > 1, C(r) decays below the fluctuation threshold AC = l/^/W{¥) [T2j, before 
the scaling gets established, thus obviating the appearance of positive exponents (3. In 
those cases, we measure a flat power spectrum in the low frequency part as one expects 
for a random sequence. The finite size deviations of C(r) at very large r show up in the 
very low frequency part of the power spectra, too. 

Obviously, one cannot expect the stationary C(r) of the general model to be 
described by a similar simple expression as has been obtained for the single-site 
dupUcation-mutation dynamics in (jSBI)- Consider, for example, a segmental duplication 
process, copying segments of length ^l = 50. In case this is the only duplication process 
present, it will introduce a peak in C(r) at a distance corresponding to its segment 
length r = ii. If there is an additional duplication processes present, e.g. one with 
£2 = 1, the peak in C(r) established by the first duplication process will be shifted 
to larger distances by the second process. The functional form of C (r) will thus show 
complex behavior on short scales reflecting the "microscopic" details of the elementary 
processes (see figure E}. But what about the large- distance asymptotics of C(r) for 
r ^ ^max ? In this regime, the dynamics of C{r,t) is given by equation (fnijl . Carrying 
out a continuum limit, the difference equation (|T^ can again be written as a simple 
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length r 

Figure 5. Stationary C(r) at different rates of the elementary processes for 
the general model with various segmental processes present: Numerical results 
(circles) with the analytic asymptotics (|27|) (lines) for n = 1.0 and varying rates 
of the other processes (rates not specified in the plot are zero). 



differential equation, 

^C(r, t) = -4/ieffC(r, t) - Ar|-C(r, t). (26) 

The stationary solution of equation (pUj) immediately yields the power-law decay 

C(r) oc r-" with a = 2x = (27) 

A 

Hence, on macroscopic distances r ^ £max our model universally produces long-range 
correlations in the sequences, irrespectively of the microscopic details of the individual 
processes. The decay exponent a depends on only two effective parameters which 
are simple functions of the rates of the processes. Using these analytic results, we 
furthermore can qualitatively classify the four different types of processes whether they 
increase a, or decrease it. Duplications are the only processes with da/dSi < 0, since 
they raise the growth rate A, but have no effectively mutational influence on large scales. 
All other processes, in contrast, will lead to larger values of a and thus to faster decaying 
correlations along the sequence by an increase of their rates. 

To verify these analytic results, we show the measured correlation functions C(r) 
of simulated sequences with all sorts of different processes present in figure El While on 
short scales the correlations reveal the microscopic details of the particular processes, 
in the asymptotic regime long-range correlations are ubiquitous. Their functional form 
is accurately described by our analytics (P7|) with the effective rates Q and Q. 
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5. Finite-size distribution of the composition bias 

Up to this point, we have discussed correlation functions, which are defined as averages 
over an ensemble of sequences generated by the same stochastic dynamics. What can 
we say about the data of a single sequence, i.e., a single realization of the stochastic 
process? To address this question, we now consider the distribution of the composition 
bias evaluated in finite sequence intervals k, . . . , k + L — 1 of length L, 

^ k+L-l 

m = - 5^ Sk'. (28) 

k'=k 

Generalizing equations (jllll and ()2fi|) . we obtain the following differential equation for 
the distribution function P{m,L,t), 

^^P{m,L,t) = -XL-^P{m,L,t) 

+ 2/iefr^[mP(m, L, t)] + ^^Pi^, L, t), (29) 

which is valid again in a continuum approximation for L ^ 1. The three terms on the 
r.h.s. describe, in order, the transport of the composition bias due to the exponential 
dilatation of the sequence, its dissipative decay, and its stochastic fiuctuations. Notice 
that the last two terms are caused by the same basic mutation process and are therefore 
both proportional to /ieff. 

We limit ourselves here to evaluating the equilibrium distribution P{m, L) 
asymptotically for large values of L. The solution of ()29|) defines different parameter 
regimes: 

(i) Strong correlation regime {x < 1/2): The large-L asymptotics is determined by 
balancing dilatation and deterministic decay, i.e., the first two terms on the r.h.s. of 
equation ()29|). For this regime, we obtain 



P{m, L) = L^V^ix) with x = mL^, (30) 

where V^{x) is a scaling function (whose form is determined by the stochastic 
dynamics on smaller scales). We can verify the consistency of the solution ()3Up 
by checking that the third term on the r.h.s. of gives a contribution which is 
subleading by a factor for large L. This result is also verified by our numerics, 

as shown in figures inta,b), where we present measured distributions P{m,L) and 
the collapse into one scaling function V^{x). Obviously, the scaling of P(m, L) also 
determines the scaling of its moments {m^){L) = J m^P{m^L)dm , 

(m^)(L) oc L-'^'^. (31) 

This is consistent with the scaling of the one-point and two-point functions, 
obtained in equations (|THl and 
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Figure 6. Numerically measured distribution functions P{m, L) and the 
corresponding scaling functions V{x) for L = 10^,10^,10^. (a,b) Regime (i) 
with X = 0.1 and V{x) = L-^-^P{L-^ '^x, L). (c,d) Regime (ii) with x = 1-0 
and the Gaussian scaling function V{x) = L~^/^P(L~^/^x, L). The deviations 
for L = 10^ for both regimes are due to the fact that the analytic asymptotics 
is only valid for large L. The ensemble averages were obtained by averaging 
over 10^ sequence realizations for each parameter setting with random initial 
conditions, resulting in symmetric distributions (only positive values shown). 



(ii) Weak-correlation regime {x > 1/2): Equation 
Gaussian form, 



has an exact solution of 



P{m,L) 



exp 



(m 



with ^(x) 



X 



(32) 



27re(x) 

This solution has the expectation value 

{m){L) =moL-'' (33) 
(with the coefficient mo determined by the initial condition) and the variance 

ax) 



{m'){L)-{m)\L) 



(34) 



It is thus of similar form as the simple fluctuation-dissipation equilibrium 
exp[— m^/2L] for A = 0, obtained from the last two terms on the r.h.s. of (j^Uj) . 
The transport term generates an additional length scale ^ since individual sites 
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are not completely independent of each other but are strongly correlated on 
scales smaller than ^ due to duplications. This reduces the number of effectively 
independent fluctuating sequence segments to L/^. Numerical measurements of the 
distribution P{m, L) in this regime for random initial conditions (mo = 0) and the 
corresponding scaling function V^{x) oc exp[— a;^/2^(x)] with x = mL^^"^ are shown 
in figures in^Cjd). 

(iii) Transition point (x = 1/2): The solution of ()29p is still of Gaussian form, 



The existence of two different scaling regimes has direct consequences for the 
detectability of correlations from data of a single sequence on large scales. In the 
strong-correlation regime (x < 1/2), the composition bias on arbitrary large scales L 
is determined primarily by the ancestral bias, while the mutational fluctuations can be 
neglected asymptotically. In the weak-correlation regime, the ancestral bias can only 
be detected on scales L < L*, while the mutational noise is dominant on larger scales. 
The scale L* can be estimated by equating the average {m){L*) with the rms. deviation 
((m - (m))2)(L*))^/2 gi^gn equations ^ and 

The difference between the strong- and weak-correlation regime is illustrated in 
figure where we show two single sequences generated from an ancestor letter +1. In 
the strong-correlation regime, the entire sequence has a detectable bias towards +1, with 
islands of —1 tracing back to their ancestors generated by mutation events (figure 0^ a)). 
In the weak-correlation regime, the sequence is seen to consist of strongly correlated 
segments of length ~ 5, but it looks random on larger scales (figure 0b)). 

We stress again that the existence of two different scaling regimes with a transition 
at X = 1/2 is a feature of the full distribution P(m, L) in the asymptotic regime L ^ 1. 
Expectation values such as the composition bias (fT^ and the correlation function (fTTj) 
have a universal form in both regimes and no transition at x = 1/2. 

6. Model extensions and symmetry breaking 
6.1. Biased insertions 

In the following, we will investigate a generalization of the dynamical model and thereby 
demonstrate the universality of our approach. For simplicity, we start with a single- 
letter model (£max = 1)- In contrast to the original model of section |21 where random 
insertions were defined as the insertion of random letters x = ±1 at position k + 1, 
which was independent of the preceding sequence element Sk, we now want to consider 
biased insertions. This extension is biologically well motivated, since there is ample 
evidence by now that the rates of segmental insertions into the genome, as e.g. those 
of interspersed repeats, are biased by the local GC-content of the genomic region [T3j. 
Formally, the biased insertion process in our model is defined by 




(35) 




insertion rate r], 



(36) 
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(a) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-+-+++++++- 

+ -H + + + + + + +-+H h + + + + H +H h + + + + 

+ + + -- + + + + + + + + + + + + + + + + + + + + + + + + + + + - + + + + + +--+ + + - + + + 

+++++++++++++++++--++++++++++++++++++++ +++++++++++++--+++++++++ 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + H 1 h + + + + + + + + + + + + + 

+++++++++++ +-++++++++++++++++++--++++++++++++++++++++++++++++++ 

(b) +H hH h- + H h + H I- + + H h + + + + + + H H- + + H H- + H 1 \ 1 + + 

- + + + + + + + + + - + + + + + - + + + + + + + + + + + + + - + + + - + + 

+ + + + + + + -- + + + + -- + + + - + + + + + + + + + - + + + - + -- + -- + + + -- + 

+ -- + -- + + + + + + + + + + + + - + - + -- + + -- + + + - + + - + + + + -- + + + + 

-+-+++-+++++ ++++++-+--++-++++++ +-+--+++++--+-+--+ +--+-+ 

+ + -- + + + + + + + + + + + + + + -- + + + + + + + + + + + + + - + -- + + + + -- + + + + - + 

Figure 7. A single sequence of length = 400 generated by the expansion- 
randomization process from an initial letter +1. (a) Strong-correlation regime 
{fi = 0.5, 6i = 10.0, i.e. x = 0-1 < 1/2): The sequence retains a net 
composition bias towards -|-1 in its entire length, i.e., the initial composition 
bias is detectable. Minority islands of —1 are found on all scales, (b) Weak- 
correlation regime (// = 0.5, 6i = 1.0, i.e. x = 1-0 > 1/2): The sequence 
consists of strongly correlated islands of length ^ ~ 5 but looks random on 
larger scales. The initial composition bias is not detectable. 



where y[s] denotes a randomly chosen letters y[s] = ±1 with an average bias depending 
on the value of the preceding sequence element s, 

{y[s]) = iys, iye[-lA]. (37) 

The degree of dependence can thereby be tuned by a parameter u. In fact, the random 
insertions of the original model are the special case of this generalized process using 
z/ = 0, while z/ = 1 corresponds to duplications. 

The contributions of this process to the dynamics of the joint-probabilities 
Peq/op{r,t) can still be calculated exactly. ()16aj) and ()16ej) - ()16/jj) will not be 
affected, since the biased insertion process will neither change the effect of single-site 
mutations, nor the "shift" and "transport" of joint-probability. However, an additional 
multiplicative factor (1 — z/) has to be incorporated in ()16&|1 and ()l(ic|l . while effects 
on ()16dj) are described by an additional factor {1 + u). Concerning the Master equation 
for C{r) in the continuum limit ()26|). this biased insertion process does therefore not 
affect the asymptotic growth rate A, but the effective mutation rate is now given by 

fieS = f^+^{'^-^)V- (38) 

We shortly want to mention that the biased insertion of single letters can generically 
be extended to the biased insertion of segments (2/[s])^ at a rate rji with an average bias 
of their elements {yi[s]) = v^s. In this case, one might actually have z/^ = z/(£), and 
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asymptotically for the effective mutation rate we yield 

t^es = + ^'^{'^ - J^e)ve- (39) 
i=i 

6.2. Biased mutations and symmetry breaking 

The model considered so far was symmetric concerning Sk — ^ — s^, i.e., the rates of all 
processes were independent of Sk- However, it is known that this symmetry is not granted 
for genomic evolution. For example, distinct mutation rates of different nucleotides lead 
to the unequal frequencies of the four different nucleotides along genomic DNA jl4j . 
In the following we will show that the restriction to symmetric processes is not crucial 
concerning the emergence of long-range correlations and the universal scaling of the 
generated sequences. A simple scenario breaking the model's Z2 symmetry is the choice 
of asymmetric mutation rates, 

(■•■,+l,---) ^ (■■■,-l,---) rate/i+ (40a) 

(■■■,-l,---) ^ (■■■,+l,---) rate/i", (406) 

with fi^. In this case, the Master equations of the probabilities Pk^{t) are 

r\ -^max 

-P±(t) = ± fi-p^ T t^^Pt + E - 1' ^) ( V2 - Pt) 

£=1 

(^max \ 
E Pk-,e'Pkl (41) 

and we have already shown in section IH?^ that asymptotically P^ is independent of k if 
all sequence sites Sk are allowed to mutate. Thus, for the asymptotic stationary average 
composition bias (sk) = P^ — P~ in the asymmetric mutation model we obtain 

(sk) = f ;f, (42) 

Concerning the dynamics of the joint probabilities -Peq/op('^5 ^)5 the introduction of 
asymmetric mutation rates will only change the mutational term, while the contributions 
of duplications, random insertions, and deletions will not be affected. In the asymmetric 
model, the Master equations for Pf,q/op{r,t) are now given by 

^Peq(r, t)= + (/i+ + /i~)Pop(r) - 2/i+P++(r) - 2/i-p— (r) + Qeq(r, t) (43a) 

d_ 

di' 

where P++/ (r) are the joint probabilities of simultaneously finding Sk = Sk+r = +1 
and Sk = Sk+r = —1, respectively. Qeq{r,t) denotes the terms (jl6fe|) - (116/z|) with the 
fc-dependence of Peq/op{r,t) already dropped, while Qop{r,t) is obtained by exchanging 
Peq and Pop. The dynamics of C{r,t) in the asymmetric model is therefore 

^C(r,t) = -2(/x+ + +7+ff) [C(r) + {sk)'] + [Qeq(r,t) - Qop(r,t)], (44) 



-Pop(r,t) =-(/.+ + ti-)Pop{r) + 2/i+P++(r) + 2/i-p~(r) + Qop{r,t), {A3b) 
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where we used ^ and (s^) = P+ - p- = P++(r) + P+-{r) - P-+{r) - p-^{r) with 
P~^~{r) = P ^(r). Defining the effective mutation rate of the asymmetric model, 

/^eflf = ^ + /i~ + 7eff ) ' (45) 

the stationary solution of this dynamics in the continuum limit is now given by 

C{r) oc r-" + with a = 2x = (46) 

A 

The magnitude of the segmental composition bias scales as 

{\m{L)\)<xL-^ + {sk). (47) 

Hence, breaking the Z2 symmetry by introducing asymmetric mutation rates will 
not change the long-range correlations and the general scaling of the model. It is 
obvious from equations (jlUj) and (jTfj) that the scaling still holds for the connected 
correlation function C"^(r) = (skSk+r) — {sk}^ and the shifted segmental composition 
bias J2i^=k~^ Sk'\) - (sk)- 



6.3. Universality 

The structure of equation reveals the basic mechanisms generating long-range 
correlations in a very general class of expansion-randomization systems that share 
three fundamental characteristics of their dynamics. The first feature is an overall 
exponential expansion of the system transporting correlations from shorter to larger 
sequence distances (combined effects of duplications, insertions, and deletions in our 
model). Mathematically, this transport is described by a dilatation operator rd/dr 
(second term on the r.h.s. of ()26|)). On the other hand, all correlations are counteracted 
by local processes randomizing the sequence (mutations) and therefore trying to diminish 
C(r) (first term of ()2(j|l ). The competition between expansion and randomization results 
in an algebraically decaying C(r) oc r~" in the stationary state, with a determined by a 
simple ration of effective growth rate and effective mutation rate. Calculation of these 
two fundamental parameters for any set of processes constituting such system determines 
the large-distance asymptotics of the correlations in the generated sequences. However, 
C(r) = for all r, is also a stationary solution of equation (jlHI)- Hence, in order for 
long-range correlations to be established, a third necessary feature of such systems is the 
presence of a mechanism continuously producing correlations on short scales. They serve 
as an ongoing reservoir for the transport of correlations to larger sequence distances and 
ensure the existence of a non-zero value C(ro) > for a specific ro > 1 (in our model, 
these initial correlations on short-scales are produced by duplications). As an intuitive 
example for the necessity of this third condition, consider an expansion-randomization 
system with mutations and insertions of single random letters, but no duplications. This 
system features exponential expansion, as well as local randomization. But the insertion 
process is not capable of producing C(l) > 0, and therefore no long-range correlations 
can be established in the generated sequences. 
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As expected from standard scaling theory, the decay of the two-point function has 
twice the exponent as the corresponding decay of the one-point function. The value x 
can be interpreted as the scaling dimension of the variable Sk in this universality class. 
There is a 1-parameter family of decay exponents as, for example, in the Gaussian model 
in two dimensions. This universal behavior is unaffected by the breakdown of the Z2 
symmetry, which manifests itself only in the non- universal constants in and ()46|). 



6.4- Numerical Analysis 

Numerical simulation of the stochastic sequence dynamics was implemented using a 
Monte Carlo procedure. During each discrete time step 

At = e ■ [(/i + J2^[5, + 7/ + le])N{t)]-' (48) 

with a tunable parameter e < 1, we choose a random site and randomly let a process 
act on it. The probability Pa of a process a being executed on the drawn site is 

Pa = rate(a) ■ At. (49) 

The overall probability of executing any process on the drawn site therefore depends on 
the parameter e. While e = 1 assures exactly one process being executed, for small e, 
on he other hand, no process will be chosen to act on the drawn sites in most of the 
cases. We use e = 0.1 for our numerical simulations. 

For a single realization of the stochastic dynamics, the average segmental 
composition bias {\m{L)\) and the correlation function C(r) are well approximated by 
sequence averages, 

k+L-l 

J2 



1 1 



N-L ^ L 

k=l 
N-r 

N-r 



k'=k 



(50) 

Cir) ^ —— J2 SkSk+r, (51) 



fc=i 

for sufficiently small values of r and L to allow efficient averaging. Averaging over 100 
sequence realizations reduces the noise further and produces very accurate measurements 
of (|m|)(L) and C{r). 

If the dynamics obeys Z2 symmetry, we can directly infer the decay exponent a from 
these measurements, according to equations (pTH) and (j2SI)- However, if the Z2 symmetry 
is violated, these power laws have to be disentangled from the additional constants 
(sfc) respectively (s^)^, see equations and pHjl . If the microscopic processes are 
known, these non-universal constants can be calculated. A numerical problem arises, 
however, in the analysis of genomic DNA sequences, where the Z2 symmetry is broken 
by an unknown amount. In that case, we can self-consistently fit the data in the form 
(|m|)(L) = aL~^+c and C{r) = br~'^^+c'^. Hence, the link between the finite-size scaling 
of (|m|)(L) and the scaling of the correlation function C(r) dictated by universality is 
of practical importance for data analysis. In particular, it is not justified in general 
to approximate the constant c by l/A^X]^i for sequences of finite length N in the 
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strong correlation regime x < 1/2, as it is often done in the literature. Furthermore, we 
can check consistency with the exponent /3 = 1 — 2% of the GC power spectrum. Power 
spectra can easily be obtained using the Fast Fourier Transform algorithm pEj . 

7. Dynamical correlations 

7.1. Correlation build-up 

Up to now, results for the correlations C{r) in our model have only been obtained for 
the stationary state, reached in the limit t oo. We now take a closer look on the 
dynamical aspects of the build-up of correlations in growing sequences. Starting with a 
sequence S(t = 0) = (x), where x = ±1 denotes a uniformly distributed random letter, 
the correlations are found to be present from the beginning. Figure IHf a) gives examples 
for C{r) measured along short single sequence realizations of length N{t) = 10^, 10^, 
and 10^. 

But of course, correlations cannot be present right from the beginning on all scales if 
we use a sequence S{t = 0) = (si, satq) with length Nq > 1 as initial condition, whose 
letters are randomly chosen (and thus uncorrelated) . All the processes of our model 
are local processes: a single step can introduce correlations only up to a microscopic 
length-scale £max- Thus, there will be a cutoff-length r*(t), up to which correlations 
can have been established at time t > 0. It is determined by the average distance, two 
copies of a duplication event at t = are separated from each other along the sequence 
at time t. Therefore we have 

r*(t) =£maxexp(At). (52) 

FiglHfb) shows that r*{t) marks the range where C{r) will start to deviate significantly 
from its stationary form. 

7.2. Distinct dynamical regimes and correlation decay 

There is ample evidence that the rates of local evolutionary processes are not constant 
in time We mimic this non-stationarity of the individual process rates by the 

succession of several distinct dynamical phases. For each individual phase n, the rates 
of the elementary processes are constant during the time interval tn~i < t < tn and 
result in specific values of A^"-* and for that particular phase. Between different 
phases, however, the complete set of rates may change, 

phase 1: {^^^\ ■ ■ ■ ) for to < t < ti 
phase 2: (/i^^), \ . . . ) for ti < t < t2 

(53) 

phase n: {fi^"'\ ^l""*, ■ ■ ■ ) for t„_i < t < t„ 

Using the findings of section mi we can generalize our dynamics with respect to varying 
rates during sequence evolution. We start with the following simple two-stage scenario: 



Universality of Long-Range Correlations in Expansion- Randomization Systems 



20 



"1 — ' r 



2 1 

10 10 



° N(t)= 


10' 


- N(t)= 


10^ 


' N(t)= 


10^ 




4^6 

10 10 10 




length r 



length r 



Figure 8. Time-dependent correlations C{r,t). (a) Build-up of long-range 
correlations by stationary growth. Measured C{r,t) at various intermediate 
lengths N{t) = 10^, 10'^, 10^ (symbols) together with the stationary form (|23j) 
for II = 1.0, 6i = 8.0 (line), (b) Correlation build-up from a random sequence 
of length A'^o = 10^. At t = the processes started acting on the sequence 
with rates /i = 1.0, 6i = 10.0. Measured C{r,t) (symbols) of the simulated 
sequences after various times t (averages over 100 realizations). Black crosses 
denote the corresponding mean sizes r*(t) = exp(At). Correlations have been 
established in the sequences according to their analytic stationary form (red 
line) in the regime r < r*{t), while they vanish for r > r*{t). 



sequence growth with rate A*^^^ > for < t < ti, followed by a second phase with 
A'-^^ = and therefore {N)(t) = N^'^^ for t > ti. It is obvious from equation (j26|) that 
stationary long-range correlations only emerge as long as the sequence grows, i.e. for 
A'-"^ > 0. The time-dependent solution of (j26j) for the asymptotics of C(r) during the 
second phase {t > ti) then takes the form 

C{r, t) = C{r, ti) e-^'^'a ^* oc r'^^/"^'"' e-'^'"^^' (54) 

with At = t — ti. Thus, the long-range tails of the correlations established during the 
first phase are preserved in the second phase, but their amplitude decays exponentially 
with a characteristic time scale r = (4/i|.g )^-'^. 

In the short range part, however, correlations may still be present depending on 
the particular set of process rates chosen to assure A'-^-* = 0. If, for example, all rates 

(2) +(2) —(2) 

5^ , 7^ , 7^ are zero in the second phase, the only process acting will be mutation 
which exponentially destroys correlations uniformly along the sequence, and thus the 
amplitude of C(r) will decay according to equation (jMj) for all lengths r. The situation 
becomes more complex if A*^^-* = is accomplished in the presence of duplications by 
a compensatory increase of the deletion rate. In this case, the duplication process will 
keep correlations present at short lengths since there is always a finite probability that a 
site Sk recently originated by a duplication of Sk-i (which again might be a duplication 
of Sfc_2, and so on.) and was not yet affected by a mutation event. Numerical results 
for this type of two-phase dynamics are shown in figure IH^a), verifying the exponential 
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Figure 9. (a) Decay of correlations during sequence evolution at stationary 
length A^o = 10^. Measured C{r,t) at various times At (symbols) together with 
the analytic decay of the long-range tail given by equation (|54|) . In the previous 
growth phase for t < tQ, correlations have been established by a single- letter 
duplication- mutation dynamics with ^ = 1.0 and 5i = 8.0 until the sequences 
reached the length A^'o = 10^. For At = t—ti > 0, a single-letter deletion process 
with 7]~ = 8.0 was introduced. Note that the correlations on short scales are 
preserved during the second phase, (b) C(r) with two scaling regimes 1 and 2 
(symbols). Process rates are: /U^^^ = 1.0, 6^^^ = 10.0 and = 1.0, 5^ -* = 2.0. 
The dashed red line is the analytical C(r, t) for the parameters of phase 1. The 
second phase lasted over a period of time that on average allowed the sequences 
to increase their length by a factor of 100. For each scaling regime {n = 1,2), 
C{r) obeys the predicted algebraic decay with exponent a^"^ = Afj,^^ / X^'^\ The 
transition between both regimes is sharp and its position agrees with the value 
predicted by 

decay of the long-range tail, predicted by equation (jHD). 

In a general evolutionary scenario, with several distinct dynamical phases and 
arbitrary values of A*^"^ and for each particular phase, the functional characteristics 
of the correlations in the generated sequences will be shaped by a combination of 
correlation build-up and decay, according to the mechanisms which have been revealed 
above. During phase n with A*-"-* > 0, correlations will be established with a^"-* = 
A and they will approximately range over a length scale r = l,...,rmax 
with Tinax = exp(A*-"'' At„) . The correlations already present from the previous phases 
will be transported to larger sequence distances. If they ranged across an interval 
r = 1, . . . , N{tn-i) at the end of phase n — 1, they will be shifted to the interval 
r = N{tn-i), ■ ■ ■ ,N{tn) during phase n. The long-range tails, however, will still obey 
the same exponent corresponding to the effective rates of the original growth phase they 
have originated from. Additionally though, they are at the mercy of mutations, and 
their amplitude will therefore decay exponentially on all scales according to equation (jK^ 
with the effective mutation rate fi^^ . A numerical example of a two-stage dynamics with 
two distinct scaling regimes is shown in figure Efb). 
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Given the chronology of the process rates for all phases, we thus can in principle 
predict the different scaling regimes of the correlation function. Furthermore, given 
the measured C(r) of a sequence generated under the influence of our processes, we 
might be able to reconstruct the chronology of the ratio of the effective rates A and 
/Xeff back throughout its evolutionary history. In practice, however, such an attempt 
will be confined by two major constraints. At first, all of the above statements only 
apply to the long-range tails of C (r) . Thus, in order to perspicuously identify the decay 
exponent a of a certain rate regime, the net expansion during that regime must have 
been sufficiently large. Moreover, since the correlations of the previous phases decay 
exponentially with a time-scale r = {Afics)~^, the ratio X/fics of the succeeding phases 
should be high. Otherwise, previously established correlations will rapidly decay below 
the fluctuation threshold AC =1/ ^/N{t), and thus cannot be measured any longer. 

8. Discussion 

In this article, we have investigated a broad class of stochastic sequence evolution 
processes as possible causes of the observed long-range correlations in genomic DNA 
sequences. The emergence of such correlations is seen to be a robust feature of the 
entire class of models. They can be observed, e.g., in the two-point function and in 
the finite-size distribution of the composition bias. The power law behavior of these 
quantities is linked by a dynamical scaling theory. 

Clearly, further analysis of genomic data is needed to corroborate or refute possible 
causes of the observed correlations. Comparative genomics of closely related species 
is expected to offer a more detailed view on the elementary evolutionary processes 
shaping genomes. One has to keep in mind that genomic DNA is a highly heterogeneous 
environment ^Hj: it consists of genes, noncoding regions, repetitive elements etc., and 
all of these functional substructures may imprint their signature on the amount of 
correlations found in a particular genomic region. If a local expansion-randomization 
dynamics proves indeed responsible for these correlations, the universality established 
in this paper is crucial for the biological relevance. There is clearly a multitude of 
microscopic elementary processes, whose individual rates may be small and difficult to 
measure. These rates may vary across sequences, between species and between phases of 
evolutionary history. However, they enter the composition correlations in the mesoscopic 
range - for length scales between 10'^ and 10^ - only via two effective parameters, the 
effective growth rate and the effective mutation rate. It is this fact that provides an 
explanation for the ubiquity of long-range correlations and a way of testing the theory 
in a quantitative way. While the emergence of long-range tails appears to be universal, 
the decay exponent is not. This may also provide useful information on the expansion 
history of genomes. 

Biology has sometimes been characterized as a "science of exceptions" . There is an 
amazing diversity of biological species. Genomes encode that diversity, so the concept 
of universality, which has proved so successful in physics, would hardly seem to be 



Universality of Long-Range Correlations in Expansion- Randomization Systems 



23 



applicable to biology at first glance. However, this may well depend on the questions 
we ask, and even the above quote may have its exception. Genomic correlations could 
be an example of universality in evolutionary biology. 
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